Data generation and processing

Sequencing protocols, best practice variant calling and filtering

Per Unneberg

NBIS

11/22/22

Data generation

Population genomics - the data

Since the goal of population genomics is to analyze variation in a set of individuals, data generation consists of compiling variation data from individuals. Here the focus is on next-generation sequencing data.

Error in read_excel(basename(url)): could not find function "read_excel"
Error in `colnames<-`(`*tmp*`, value = c("Date", "Mb", "Genome")): attempt to set 'colnames' on an object with less than two dimensions
Error in `ggplot()`:
! `data` cannot be a function.
ℹ Have you misspelled the `data` argument in `ggplot()`

Lou et al. (2021)

RADSeq

High coverage sequencing data

Pros: call genotypes with confidence

cons: cost

Low coverage sequencing data

lcWGS - alleviates cost issue

Read mapping and variant calling

GATK best practice

Point out: not optimal for

Alternative variant callers

freebayes

bcftools

ANGSd

PoolSeq

Popoolation

Variant filtering

General filters

Table 1: Key data filters (Table 3 Lou et al., 2021, p. 5974)
Category Filter Recommendation (examples)
General filters Base quality Recalibrate / <Q20
Mapping quality MAQ < 20 / improper pairs
Minimum depth and/ or number of individuals Varies; e.g. <50% individuals, <0.8X average depth
Maximum depth 1-2 sd above median depth
Duplicate reads Remove
Indels Realign reads / haplotype-based caller / exclude bases flanking indels
Overlapping sections of paired-­end reads Soft-clip to avoid double-counting
Filters on polymorphic sites \(p\)-value \(10^{-6}\)
SNPs with more than two alleles Filter; methods often assume bi-allelic sites
Minimum minor allele frequency (MAF) 1%-10% for some analyses (PCA/admixture/LD/\(\mathsf{F_{ST}}\)
Restricting analysis to a predefined site list List of global SNPs Use global call set for analyses requiring shared sites

Refs

Lou, R. N., Jacobs, A., Wilder, A. P., & Therkildsen, N. O. (2021). A beginner’s guide to low-coverage whole genome sequencing for population genomics. Molecular Ecology, 30(23), 5966–5993. https://doi.org/10.1111/mec.16077
Talla, V., Soler, L., Kawakami, T., Dincă, V., Vila, R., Friberg, M., Wiklund, C., & Backström, N. (2019). Dissecting the Effects of Selection and Mutation on Genetic Diversity in Three Wood White (Leptidea) Butterfly Species. Genome Biology and Evolution, 11(10), 2875–2886. https://doi.org/10.1093/gbe/evz212